Introduction to Open Data Science - Course Project

About the project

RStudio Exercise 1:

for course Introduction to Open Data Science 2021

Tasks and Instructions 1.(/5)

:white_check_mark:

Tasks and Instructions 2.(/5)

Feeling a bit exited for this course.

I heard about this course through University of Eastern Finland (internal communication) Yammer pages.

Link to Github repository here.

date()
## [1] "Mon Nov 29 08:35:08 2021"

RStudio Exercise 2

This week I have done quite a bit of DataCamp practices. So much so that I went on to next weeks’ DataCamp practices as DataCamp’s bottom grey-blue bar indicator of the progress was showing five distinct chapters. I did not release that one bar was for one week. This made me stress quite a bit as there didn’t seem to be an end to the DataCamp practices.

Unfortunately I have not managed to start reading the recommended reading material this week, which I really know I should.

I started the exercises too late.

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Nov 29 08:35:08 2021"

1. Read the data

Read the data file to an object. Display the object (data table).

# Read the data
learning2014 <- read.table(
  file = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt",
  header = TRUE,
  sep = ",")

# check first few lines of the read data... table
head(learning2014)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21

Looking at the table, note to prior Data Wrangling part: attitude should have been scaled (divided by 10) as in DataCamp exercises. Exercise instructions were unclear on this (in the Data Wranglin part).

Check data dimensionality:

dim(learning2014)
## [1] 166   7

Check data structure:

str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

Dataset is from a questionaire study, “international survey of Approaches to Learning”. Check here.

“gender” is sex/gender of the questionaire participant. Data type is character (chr).
“age” is the age of the participant in year (presumably). Data type is integer (int).
“attitude” is “Global attitude toward statistics” divided by 10. Data type is integer (int).
“deep” is a mean value of the following original variables: “D03”, “D11”, “D19”, “D27”, “D07”, “D14”, “D22”, “D30”,“D06”, “D15”, “D23”, “D31”, summarizing deep learning of the participant. Data type is numeric (can be numbers with decimals).
stra is a mean value of the following original variables: “ST01”,“ST09”,“ST17”,“ST25”,“ST04”,“ST12”,“ST20”,“ST28”, summarizing strategic learning of the participant. Data type is numeric (can be numbers with decimals).
surf is a mean value of the following original variables: “SU02”,“SU10”,“SU18”,“SU26”, “SU05”,“SU13”,“SU21”,“SU29”,“SU08”,“SU16”,“SU24”,“SU32, summarizing surface learning of the participant. Data type is numeric (can be numbers with decimals).
points are exam points. Data type is integer (int).

2. Graphical overview, summaries

First we might need to install some packages to make some nice graphical overviews:

# uncomment if you need to install
#install.packages("GGally", "ggplot2")

Let’s start plotting some scatter plot matrix from our data:

# access the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

As you can see the (multi)plot above is quite busy. Lots of information to take in.
Let’s start of by noting the column names above and row names on the right side. There is no legend for gender, but checking the plot there is F and M, F is for Female and M is for Male. So this plot is color coded by gender.

Top left corner, gender column and gender row, we can see the counts of each value. F pinkish color, M cyan. We can see the female participant count is almost double the male count. Let’s check this:

table(learning2014$gender)
## 
##   F   M 
## 110  56

Top row (after gender to right) we have boxplots per each gender. Boxplot’s describes the spread/distribution of values. The box part of boxplot describes interquartile range from Q1 (25th percentile) to Q3 (75th percentile), with the median (Q2/50th percentile) of values line in between Q1 and Q3. Circles describe possible outliers (as per presuming normal distribution of the values which is not always the case).

Left side (under gender), barplots display value of row (y-axis) to count of the values (x-axis) (AND BY GENDER).

Scatter plots (plots with points) are column values (x-axis) plotted against row values (y-axis).

We also have pairwise correlation information, one variable against another. Most interesting of teh correlations are Points and Attitude are positively correlated with each other. Deep learning and surface learning are negatively correlated with each other, but only with Males. So the values of these variables are somehow intertwined or in relation to each other.

Print summaries:

summary(learning2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

For numeric values of each variable we have the minimum and maximum value, mean. Median and 1st Quantile and 3rd Quantile are presented in boxplots (as previously explained). If the value distribution would be normal then 50% of the values would be inbetween Q1 and Q3 values.

3. Three variables as explanatory variables and fit a regression model

Multiple regression model as there are more than three explanatory variables.

Let’s have deep, stra and surf as explanatory variables for target variable points.

# create multiple regression model
my_model <- lm(points ~ deep + stra + surf, data = learning2014)

# print out summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ deep + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1235  -3.0737   0.5226   4.2799  10.3229 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.9143     5.1169   5.260  4.5e-07 ***
## deep         -0.7443     0.8662  -0.859   0.3915    
## stra          0.9878     0.5962   1.657   0.0994 .  
## surf         -1.6296     0.9153  -1.780   0.0769 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.827 on 162 degrees of freedom
## Multiple R-squared:  0.04071,    Adjusted R-squared:  0.02295 
## F-statistic: 2.292 on 3 and 162 DF,  p-value: 0.08016

Does not seem too good. When doing multiple regression we are more interested in adjusted R-squared value, which is quite low, so the model doesn’t really explain. Let’s change up the model by switching one of the explanatory variables.

New model:

my_model <- lm(points ~ attitude + stra + surf, data = learning2014)

# print out summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

4. i) Explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters) ii) Explain and interpret the multiple R squared of the model

i) Explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters

Increase of 1 exam point is associated with 0.339 increase in attitude points. 0.853 is estimated effect of strategic learning to exam points adjusting for attitude. -0.586 is estimated effect of surface learning to exam points adjusting for attitude and strategic learning.

ii) Explain and interpret the multiple R squared of the model

Adjusted R-squared is 0.1927, so almost 20% of variation in exam points can be explained by attitude, strategic learning and surface learning. Adjusted R-squared is the more important measure as we have multiple explanatory variables.

5. Plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

Let’s produce the plots: Residual vs Fitted values, Normal QQ-plot and Residuals vs Leverage

# create the model again just in case
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)

# draw diagnostic plots using the plot() function. Choose the plots Residual vs Fitted values, Normal QQ-plot and Residuals vs Leverage (which are 1,2 and 5)
par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))

Residuals vs Fitted: Reasonable random spread of points. So there shouldn’t be a problem with the assumptions.

Normal Q-Q: In the middle there seems to quite good fit to the normality line, but in the beginning and in the end there is a deviation from the line. So there isn’t a reasonable with to the normality assumption.

Residuals vs Leverage: this plot can help to identify observations that have unusually high impact. There is seemingly few outlier values in the bottom of the plot. So these outliers may highly impact the model in question.


RStudio Exercise 3

2. Read data

alc <- read.csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv", header = TRUE, sep = ",")

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "n"          "id.p"       "id.m"      
## [31] "failures"   "paid"       "absences"   "G1"         "G2"        
## [36] "G3"         "failures.p" "paid.p"     "absences.p" "G1.p"      
## [41] "G2.p"       "G3.p"       "failures.m" "paid.m"     "absences.m"
## [46] "G1.m"       "G2.m"       "G3.m"       "alc_use"    "high_use"  
## [51] "cid"

Original data set description can be found here. The read data is combination of two data sets regarding the performance in Mathematics and Portuguese language. So originally two different data sets with same variables relating to student grades, demographic, social and school related features.

Variables/column names with suffix .p are originals from por (Portuguese language) dataset and .m from mat (Mathematics) dataset. The combination dataset in question has the following variables: “failures”, “paid”(first value selected), “absences”, “G1”, “G2”, “G3”, are averaged from mat and por datasets. “alc_use” is averaged from “Dalc” workday alcohol consumption (numeric: from 1 - very low to 5 - very high) and “Walc” - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high). “high_use” is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise.

3. Choosing 4 interesing variables to study relationships between high/low alcohol consumption

higher - wants to take higher education. I hypothesize that students not wanting to take higher education would have higher alcohol consumption.

famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent). I think family relationship might be an interesting variable that might be highly correlated with alcohol consumption as in students with bad family relationships would likely drink more and vice versa.

age - my hypothesis is that older persons will likely have higher alcohol consumption, as the age range is from 15 to 22, over 18 persons can buy alcohol legally.

failures - number of past class failures (numeric: n if 1<=n<3, else 4). I hypotize that high failures would correlate with high alcohol consumption.

4. Exploring distributions of previously chosen variables

Check our variable types etc.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
my_vars <- c("high_use", "higher", "age", "famrel", "failures")
interest_alc <- subset(alc, select = my_vars)

glimpse(interest_alc)
## Rows: 370
## Columns: 5
## $ high_use <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, F~
## $ higher   <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes"~
## $ age      <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1~
## $ famrel   <int> 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 5, 5, 3, 3, 4~
## $ failures <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~

Let’s examine our selected variables’ “higher”’, “age”, “famrel”, “failures” and “high_use” count distuributions.

library(tidyr); library(ggplot2)

gather(interest_alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

higher

Interesting, it looks like most do want to take higher education. Let’s check the actual counts on that one and plot with them.

alc %>% group_by(higher) %>% summarise(count = n())
## # A tibble: 2 x 2
##   higher count
##   <chr>  <int>
## 1 no        16
## 2 yes      354
alc %>% group_by(high_use, higher) %>% summarise(count = n())
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 4 x 3
## # Groups:   high_use [2]
##   high_use higher count
##   <lgl>    <chr>  <int>
## 1 FALSE    no         7
## 2 FALSE    yes      252
## 3 TRUE     no         9
## 4 TRUE     yes      102
alc %>% group_by(high_use, higher) %>% summarise(count = n()) %>%
   ggplot(aes(x = higher, y = count, fill = high_use)) +
   geom_col() + 
   facet_wrap("high_use")
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.

I am somewhat suprised by this that the majority of the students want to continue higher education. So there is very few who overall who do not want to take higher education. Looks like my hypothesis is not withholding.

famrel

Summarize counts of high_use by famrel in table and in barplot.

alc %>% group_by(high_use, famrel) %>% summarise(count = n())
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 10 x 3
## # Groups:   high_use [2]
##    high_use famrel count
##    <lgl>     <int> <int>
##  1 FALSE         1     6
##  2 FALSE         2     9
##  3 FALSE         3    39
##  4 FALSE         4   128
##  5 FALSE         5    77
##  6 TRUE          1     2
##  7 TRUE          2     9
##  8 TRUE          3    25
##  9 TRUE          4    52
## 10 TRUE          5    23
alc %>% group_by(high_use, famrel) %>% summarise(count = n()) %>%
   ggplot(aes(x = famrel, y = count, fill = high_use)) +
   geom_col() + 
   facet_wrap("high_use")
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.

famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent). Bad family relations was lower number and good were higher number. Looks like my hypothesis is rather poor in this one. The counts for high_use and bad family relationships are very low. Overall bad family relationships are seem to pretty low with students.

failures

Summarize counts of high_use by failures in table and in barplot.

alc %>% group_by(high_use, failures) %>% summarise(count = n())
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 8 x 3
## # Groups:   high_use [2]
##   high_use failures count
##   <lgl>       <int> <int>
## 1 FALSE           0   238
## 2 FALSE           1    12
## 3 FALSE           2     8
## 4 FALSE           3     1
## 5 TRUE            0    87
## 6 TRUE            1    12
## 7 TRUE            2     9
## 8 TRUE            3     3
alc %>% group_by(high_use, failures) %>% summarise(count = n()) %>%
   ggplot(aes(x = failures, y = count, fill = high_use)) +
   geom_col() + 
   facet_wrap("high_use")
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.

I hypotized that high failures would be in relation with high alcohol consumption. Does not seem to be much difference in the groups. Overall failures count is very low and the spread of high alcohol consumption with more than 1 or more failures is very close in non high alcohol consumption with high alcohol consumption groups.

age

alc %>% group_by(high_use) %>% summarise(mean = mean(age))
## # A tibble: 2 x 2
##   high_use  mean
##   <lgl>    <dbl>
## 1 FALSE     16.5
## 2 TRUE      16.8
# initialise a plot of high_use and age
g2 <- ggplot(alc, aes(x = high_use, y = age))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student age by alcohol consumption")

# lets also seem the same with sex added
g3 <- ggplot(alc, aes(x = high_use, y = age, col = sex))
g3 + geom_boxplot() + ggtitle("Student age by alcohol consumption and sex")

alc %>% group_by(high_use, age) %>% summarise(count = n())
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 12 x 3
## # Groups:   high_use [2]
##    high_use   age count
##    <lgl>    <int> <int>
##  1 FALSE       15    63
##  2 FALSE       16    74
##  3 FALSE       17    62
##  4 FALSE       18    52
##  5 FALSE       19     7
##  6 FALSE       20     1
##  7 TRUE        15    18
##  8 TRUE        16    28
##  9 TRUE        17    35
## 10 TRUE        18    25
## 11 TRUE        19     4
## 12 TRUE        22     1
alc %>% group_by(high_use, age) %>% summarise(count = n()) %>%
   ggplot(aes(x = age, y = count, fill = high_use)) +
   geom_col() + 
   facet_wrap("high_use")
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.

Average age for higher use is slightly higher, but there does not seem to be a that much more higher age for high alcohol consuming students especially over or 18.

Overall looks like I chose poor variables that might be in very poor / nonexistent relation with high alcohol consumption.

5. Logistic regression with chosen variables

We’ll perform logistic regression to explore the relationships between the famrel, age, higher, failures variables with high/low alcohol consumption as target variable

log_reg_model <- glm(high_use ~ famrel + age + higher + failures, data = alc, family = "binomial")
summary(log_reg_model)
## 
## Call:
## glm(formula = high_use ~ famrel + age + higher + failures, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6954  -0.8036  -0.7110   1.3402   1.8570  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -1.7777     1.9341  -0.919   0.3580   
## famrel       -0.2815     0.1254  -2.245   0.0248 * 
## age           0.1405     0.1029   1.365   0.1723   
## higheryes    -0.4499     0.5899  -0.763   0.4457   
## failures      0.5594     0.2102   2.661   0.0078 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 432.08  on 365  degrees of freedom
## AIC: 442.08
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(log_reg_model) %>% exp

# compute confidence intervals (CI)
CI <- confint(log_reg_model) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR       2.5 %    97.5 %
## (Intercept) 0.1690283 0.003738272 7.4864947
## famrel      0.7546270 0.589068018 0.9650944
## age         1.1508478 0.941172724 1.4104114
## higheryes   0.6376962 0.198545179 2.0808162
## failures    1.7495838 1.162299822 2.6710151

Seems a bit off since we can’t see other factors for famrel. Let’s add factors.

log_reg_model <- glm(high_use ~ factor(famrel) + age + factor(higher) + failures, data = alc, family = "binomial")
summary(log_reg_model)
## 
## Call:
## glm(formula = high_use ~ factor(famrel) + age + factor(higher) + 
##     failures, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7501  -0.8559  -0.7101   1.2656   1.8934  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)   
## (Intercept)        -3.1286     2.0803  -1.504  0.13260   
## factor(famrel)2     0.8850     0.9551   0.927  0.35411   
## factor(famrel)3     0.5687     0.8622   0.660  0.50949   
## factor(famrel)4     0.1298     0.8381   0.155  0.87693   
## factor(famrel)5    -0.2315     0.8574  -0.270  0.78714   
## age                 0.1445     0.1036   1.395  0.16310   
## factor(higher)yes  -0.4172     0.5945  -0.702  0.48281   
## failures            0.5515     0.2122   2.599  0.00934 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 429.86  on 362  degrees of freedom
## AIC: 445.86
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(log_reg_model) %>% exp

# compute confidence intervals (CI)
CI <- confint(log_reg_model) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                           OR        2.5 %    97.5 %
## (Intercept)       0.04377755 0.0006880176  2.482282
## factor(famrel)2   2.42308808 0.4105189123 20.100563
## factor(famrel)3   1.76605639 0.3673243645 12.821709
## factor(famrel)4   1.13859072 0.2499432346  7.999589
## factor(famrel)5   0.79333156 0.1665015354  5.717950
## age               1.15541978 0.9437919758  1.417882
## factor(higher)yes 0.65888088 0.2035056764  2.170862
## failures          1.73582033 1.1481286518  2.658353

Let’s try model without intercept.

log_reg_model <- glm(high_use ~ factor(famrel) + age + factor(higher) + failures -1, data = alc, family = "binomial")
summary(log_reg_model)
## 
## Call:
## glm(formula = high_use ~ factor(famrel) + age + factor(higher) + 
##     failures - 1, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7501  -0.8559  -0.7101   1.2656   1.8934  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)   
## factor(famrel)1    -3.1286     2.0803  -1.504  0.13260   
## factor(famrel)2    -2.2436     1.9831  -1.131  0.25792   
## factor(famrel)3    -2.5599     1.9126  -1.338  0.18075   
## factor(famrel)4    -2.9988     1.9279  -1.556  0.11982   
## factor(famrel)5    -3.3601     1.9482  -1.725  0.08458 . 
## age                 0.1445     0.1036   1.395  0.16310   
## factor(higher)yes  -0.4172     0.5945  -0.702  0.48281   
## failures            0.5515     0.2122   2.599  0.00934 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 512.93  on 370  degrees of freedom
## Residual deviance: 429.86  on 362  degrees of freedom
## AIC: 445.86
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(log_reg_model) %>% exp

# compute confidence intervals (CI)
CI <- confint(log_reg_model) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                           OR        2.5 %   97.5 %
## factor(famrel)1   0.04377755 0.0006880176 2.482282
## factor(famrel)2   0.10607686 0.0021202344 5.145847
## factor(famrel)3   0.07731362 0.0017718458 3.260660
## factor(famrel)4   0.04984471 0.0011021640 2.152387
## factor(famrel)5   0.03473011 0.0007347747 1.554473
## age               1.15541978 0.9437919758 1.417882
## factor(higher)yes 0.65888088 0.2035056764 2.170862
## failures          1.73582033 1.1481286518 2.658353

RStudio Exercise 4

2. Loading data (Boston)

# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")

# Structure and dimensions of the data
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Original data description can be found here.

The Boston data frame is of housing values in suburbs of Boston and it has 506 rows and 14 columns.

crim - per capita crime rate by town.
zn - proportion of residential land zoned for lots over 25,000 sq.ft.
indus - proportion of non-retail business acres per town.
chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox - nitrogen oxides concentration (parts per 10 million).
rm - average number of rooms per dwelling.
age - proportion of owner-occupied units built prior to 1940.
dis - weighted mean of distances to five Boston employment centres.
rad - index of accessibility to radial highways.
tax - full-value property-tax rate per $10,000.
ptratio - pupil-teacher ratio by town.
black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat - lower status of the population (percent).
medv - median value of owner-occupied homes in $1000s.

3. Graphical overview and summaries of the data

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Plot matrix of variables

#fig1, fig.height = 19, fig.width = 17}
library(ggplot2)
library(GGally)

# remove chas variable which is categorical
Boston_cont <- subset(Boston, select = -chas)

ggpairs(Boston_cont,
        upper = list(continuous = wrap('cor', size = 4)),
        title = "Scatterplot matrix of `Boston`")

Very few variable seems to be truely normally distributed. rm (average number of rooms per dwelling) variable seem to be the closest.

Let’s take a more graphical look at the correlations of the variables:

#fig.height = 12, fig.width = 8}
library(corrplot)
## corrplot 0.92 loaded
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

In the correlation matrix visualization above the more red color symbolizes negative correlation the bigger it is. The positive correlation is presented by blue tint and the bigger the circle, bigger the (positive).

Few picks on the correlation matrix:

crim “per capita crime rate by town” is most positively correlated with rad “index of accessibility to radial highways”.

indus “proportion of non-retail business acres per town.” is negatively correlated with dis “weighted mean of distances to five Boston employment centres”. So… bigger proportion of non-retail business acres per town is situated away from Boston’s five employment centers, in other words, industy areas are situated away from center’s of the towns.

4. Standardize dataset

Standardize and scale the variables in the dataset:

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

Standardized and scaled variables now have negative values as before there were only positive values. The values are now in “the same ball park” for all the variables.

Here we create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate) using quantiles (and renaming them “low”, “med_low”, “med_high”, “high”) and we drop the original crime rate variable out:

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Here we divide the dataset used for training (80% of data) and using to test (20% of the data):

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

5. Fitting the linear discriminant analysis on the train set

Here we fit the linear discriminant analysis (LDA) on the training set. The previously modified categorical crime rate (“low”, “med_low”, “med_high”, “high”) is used as target variable and everything else is used as predictors.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##      low  med_low med_high     high 
## 0.240099 0.240099 0.240099 0.279703 
## 
## Group means:
##                  zn      indus       chas        nox         rm        age
## low       0.9920210 -0.9214120 -0.1099744 -0.8712103  0.4593058 -0.8527438
## med_low  -0.1225635 -0.2367969 -0.1505631 -0.5738829 -0.1008532 -0.3406651
## med_high -0.3811524  0.1826759  0.2147349  0.4528041  0.1307991  0.4605612
## high     -0.4872402  1.0169038 -0.0284379  1.0824245 -0.4095268  0.8215602
##                 dis        rad        tax     ptratio       black        lstat
## low       0.8887584 -0.6977144 -0.7315225 -0.45755661  0.37934846 -0.784135255
## med_low   0.3548229 -0.5497161 -0.4287962 -0.02517508  0.30672550 -0.179183177
## med_high -0.4414763 -0.4443413 -0.3434041 -0.37041364  0.08745074 -0.007286356
## high     -0.8570994  1.6402924  1.5156033  0.78329631 -0.75354640  0.893094775
##                 medv
## low       0.50679084
## med_low  -0.01556094
## med_high  0.19147420
## high     -0.69462719
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.08465821  0.60754886 -1.05093990
## indus    0.08848298 -0.21003107  0.46219135
## chas    -0.15129083 -0.02066881 -0.03157672
## nox      0.39695234 -0.83915644 -1.18583836
## rm      -0.15902173 -0.11847904 -0.13193557
## age      0.13639700 -0.34441285 -0.16960301
## dis     -0.04709805 -0.17599004  0.32499222
## rad      3.73140684  0.76497083 -0.25858193
## tax      0.01956752  0.22263850  0.63217013
## ptratio  0.09805235  0.01976243 -0.21368667
## black   -0.09211349  0.04738973  0.09512138
## lstat    0.20540449 -0.22479604  0.34879582
## medv     0.22875849 -0.40887132 -0.02199487
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9588 0.0300 0.0111

Let’s draw the LDA (bi)plot:

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

Seems that most of the high crime rate is best predicted by rad - index of accessibility to radial highways. zn - proportion of residential land zoned for lots over 25,000 sq.ft. seems to best predict low crime rates. So a defiency of small houses predicts lower crime rate…

nox - nitrogen oxides concentration (parts per 10 million). seems to predict med_high to high crime rate as well.

6. Predicting classes with the LDA model on the test data

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       18      10        2    0
##   med_low    7      16        6    0
##   med_high   0      12       14    3
##   high       0       0        0   14

High crime rates especially seem to be most correctly predicted using the training data. Also low and med_low are predicted correctly most of the cases. The weakest prediction is for med_low crime rate. Note that the sampling (where we choose 80% for training) might have an effect on how well the med_low and med_high are predicted as their prediction is not as “easy” as for low and high crime rate. The more of a cohesive cluster of the target variable (for example in the previous plot) the more easier it is to predict.

7. Standardizing data with comparable distances… and k-means

Here we standardize the data with comparable distances and calculate the distances between the observations.

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

It is somewhat unclear whether to explore the k-means algorithm with the SCALED data… but here we go:

# fig.height = 19, fig.width = 17}

km <-kmeans(boston_scaled, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

# fig.height = 19, fig.width = 17}

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

With the amount of two clusters you can get pretty decent separation between most variables.

Bonus

Super-bonus

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)

(more chapters to be added similarly as we proceed with the course!)